-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge feature branch - Application cluster manager #2714
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Refactor the zone-based stoppable check logic and add support to randomly select zone order for the zone-based stoppable check.
Implement the cross-zone-based stoppable check and add to_be_stopped_instances query parameter to the stoppable check API
Add ability for up to 2 nodes with the same logicalId to be added to the cluster at the same time when a SWAP is happening. During all paritionAssignment for WAGED and DelayedAutoRebalancer, we select just one instance for each logicalId. Achieves n -> n+1 for all replicas on SWAP_OUT node and back n when SWAP is marked complete, making it cancelable. Adding and updating Helix Admin APIs to support swap operation: setInstanceOperation addInstance canCompleteSwap completeSwapIfPossible * Refactor sanity checks for HelixAdmin swap APIs. * Helix Node Swap pipeline changes and integration tests. * Fix integration tests to properly restore stopped MockParticipant so following tests are not affected. * Add comments and docstrings. * Fix tests to clean up after themselves. * Optimize duplicate logicalId filtering to only be called on allNodes and then used to remove duplicate logicalIds from enabledLiveNodes. * Add handling for clusterConfig == null in updateSwappingInstances and fix AssignableNode to check for clusterTopologyConfig when attempting to get logicalId. * Fix integ tests. * Fix testGetDomainInformation since we no longer allow an instance to join the cluster with an invalid DOMAIN field. * Add checks to ensure that the SWAP_IN instance has a matching FAULT_ZONE and matching INSTANCE_CAPACITY_MAP to SWAP_OUT node. * Rename canSwapBeCompleted to canCompleteSwap. * Add sanity checks to allow SWAP_IN node to join the cluster in disabled state before SWAP_OUT node has instance operation set. * Fix print in test case. * Add canCompleteSwap to PerInstanceAccessor and fix formatting. * Fix flaky node swap after completion by making sure replica has is computed with logicalIds intead of instanceNames.
…ed blacklisting capabilities (#2687) Enhanced stoppable checks with node evacuation filtering and introduced blacklisting capabilities
… kv pair for result of check or attempt to complete swap. (#2697)
…e picking assignable instances in the cache. (#2702) Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.
… and more predictable during an in-flight node swap. (#2706)
…#2710) Prevent the spectator routing table from containing SWAP_IN instances.
* Change all rebalancer strategies to create Topology without additional non-FaultZone or EndNode levels of the tree. This will allow for swap to work in clusters where the non-FaultZone or EndNode domain kv pairs don't directly match the swapping node.
* Stabilize TestInstanceOperation. clusterVerifier is evaluating to true once the partitionAssignment matches the expected value; however, it is before the TopState is transfered back to the SWAP_IN node. This can be fixed by using TestHelper.verify to check that states converge within TIMEOUT. * Moved evacuate tests with long ST resources to the end because it was taking long time to drop DBs which was causing flakyness in later tests. Ran TestInstanceOperation 5 times locally with success.
zpinto
approved these changes
Dec 20, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
One failing test seems to be flaky.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issues
#2662
#2655
Description
Merge feature branch ApplicationClusterManager, including the following changes.
Refactor stoppable check logic for enhanced zone analysis (https://github.com/apache/helix/pull/2654[)](https://github.com/apache/helix/commit/23eba7d9fcc4c979bf02f0872412af92343a1e99)
@MarkGaox
MarkGaox authored and Xiaoyuan Lu committed 5 days ago
Implement the cross-zone-based stoppable check (#2680)
@MarkGaox
MarkGaox authored and Xiaoyuan Lu committed 5 days ago
HelixAdmin APIs and pipeline changes to support Helix Node Swap (#2661)
@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Enhanced stoppable checks with node evacuation filtering and introduc…
@MarkGaox
MarkGaox authored and Xiaoyuan Lu committed 5 days ago
Change canCompleteSwap and completeSwapIfPossible to return json with…
@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Expose Evacuate Finished API in Helix-Rest (#2694)
@GrantPSpencer
GrantPSpencer authored and Xiaoyuan Lu committed 5 days ago
Fix WAGED to only use logicalId when computing baseline and centraliz…
@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Make logic to determine state of replicas on SWAP_IN instance simpler…
@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Prevent the spectator routing table from containing SWAP_IN instances.(…
@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Commits on Dec 14, 2023
Build Topology with only required levels (FaultZone and EndNode) (#2713)
Tests
Please refer to original PRs
(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
Changes that Break Backward Compatibility (Optional)
(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
Documentation (Optional)
(Link the GitHub wiki you added)
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)